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Abstract 

We give a detailed description of the so-called Polynomial Hybrid Monte 
Carlo (PHMC) algorithm. The effects of the correction factor, which is 
introduced to render the algorithm exact, are discussed, stressing their 
relevance for the statistical fluctuations and (almost) zero mode contribu¬ 
tions to physical observables. We also investigate rounding-error effects 
and propose several ways to reduce memory requirements. 
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1 Introduction 


Although lattice QCD [|l[ has nowadays reached a relatively mature age, precise 
quantitative results -at least in the full theory- are still rare. One of the main 
reasons is certainly that numerical simulations of lattice QCD, including the ef¬ 
fects of dynamical quarks, are still very demanding and computer time consuming 
(for reviews of dynamical fermion algorithms see P, |^). Efforts to improve on 
this situation are therefore highly desirable. 

In this paper we extend the discussion of the so-called Polynomial Hybrid Monte 
Carlo (PHMC) algorithm, which we introduced in 0 as an attempt to improve 
the performance of simulation algorithms for dynamical fermions. 

The main idea of the PHMC algorithm relies on dividing the eigenvalue spectrum 
of the Wilson-Dirac operator M on the lattice into different disjoint parts. These 
different parts of the eigenvalue spectrum are then treated by either incorporating 
them in the update step of a simulation algorithm or by taking them into account 
in a reweighing procedure. This general idea is realized in practise by designing 
suitable polynomials that approximate the inverse of M^M, which is needed in the 
actual simulation, with a different accuracy for different parts of the eigenvalue 
spectrum of In the present paper we choose a polynomial approximation 

to the inverse of which is equivalent to basically neglecting the contribution 

of the low-lying modes and taking very precisely into account all the other modes 
in the update step. This choice, which follows the original suggestion in 0, 
is motivated by the experience with the multiboson technique p, 0, |], p, pT[| : 
neglecting in the update a small number of low-lying modes of M still yields 
results very close to the ones obtained using the exact Hybrid Monte Carlo (HMC) 
algorithm. We want to emphasize, however, that our choice is only a special case 
and the general method allows for a greater flexibility including ideas like the one 
proposed in . 

One may argue that the reweighing step can be replaced by a reject/accept step 
in order to render the algorithm exact. We think that this is not the best choice 
for the following reason: it is expected that almost zero modes of the Wilson- 
Dirac operator appear when working in large physical volumes or at large values 
of the lattice spacing. In such a situation a reject/accept step leads to a dilemma: 
either the acceptance probability becomes so small that such events are always 
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rejected. Or, if they are accepted, the zero modes give exceptional values to quark 
propagators, distorting a statistical sample substantially. However, in full QCD, 
gauge conhgurations carrying zero modes may give a hnite contribution to several 
fermion observables, which should be taken into account -at least in principle- 
in order to get correct statistical averages. In fact, we consider this scenario as 
a potential danger for the Hybrid Monte Carlo (HMC) algorithm [^] which is 
commonly used. 

As we will demonstrate below, with a suitable reweighing procedure, this problem 
can be overcome elegantly. Namely, in our way of correcting for the polynomial 
approximation, the reweighing factor becomes proportional to the almost zero 
mode and hence cancels any singularity appearing in quark propagators used to 
construct physical observables. This mechanism reflects in a sense the role of 
the determinant when the full QCD partition function is considered. Of course, 
reweighing techniques are widespread in applications for numerical simulations. 
However, we would like to point out that our implementation of the reweighing 
factor makes its computation very straightforward and reliable in all cases and 
does not give too large an overhead in a simulation. 

In a previous publication we introduced the PHMC algorithm and gave some 
hrst, promising results in practical applications for Wilson fermions. However, 
it is by now well known that when using Wilson fermions for simulations of 
lattice QCD, one has to face large lattice cutoff effects in physical observables. 
For example, the axial Ward identity can be substantially distorted in this case 
[[I^ . However the effects of a non-vanishing lattice spacing can be systematically 
reduced by applying Symanzik’s improvement programme |]^: this turns out to 
be easier in practice if only on-shell quantities are to be 0(a) improved [|^ . 

In fact, implementing the improvement programme non-perturbatively for both 
the action and all the local operators relevant for on-shell observables, one can 
reach a complete cancellation of the cutoff effects that appear linear in the lattice 
spacing 0.0- Since with such an improved theory we expect to be able to work 
at a much larger lattice spacing, a substantial gain in the cost of numerical sim¬ 
ulations can be obtained. Indeed, the non-perturbative on-shell 0(a) improved 
action has by now already been computed also for dynamical fermions |^. Any 
new simulation algorithm should hence have the ability to be applicable to im¬ 
proved fermions. We therefore extend here our tests of the PHMC algorithm to 
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the case of 0(a) improved actions. In the present paper we are going to discnss 
a nnmber of important technical aspects of the PHMC algorithm. Nnmerical 
resnlts for the performance of the algorithm in practise are deferred to a separate 
pnblication |]^ . 


2 The PHMC algorithm 

We introdnce the PHMC algorithm and discnss several aspects concerning onr 
practical implementation of the algorithm. In particnlar, we derive the compn- 
tational cost of the algorithm in terms of matrix times vector operations. 

2.1 Introducing the PHMC algorithm 

We consider Enclidean QCD with n/ = 2 degenerate flavonrs regularized on a 
hypercubic space-time lattice with lattice spacing a and size x T. With the 
lattice spacing set to unity from now on, the points on the lattice have integer 
coordinates {t, xi, X 2 , x^) which are in the range 0<f<T;0<a;j<L. A gauge 
field U^{x) G SU{3) is assigned to the link pointing from the site x to the site 
{x + fi), where fi = 0,1, 2, 3 designates the 4 forward directions in space-time. 
Throughout the paper we will adopt Schrodinger functional boundary conditions 
JQ- The partition function for lattice QCD with nf = 2 


as detailed in pO 


degenerate flavours of quarks is given by 


Z = JVUe-^‘>^^^det{Q^[U]) , 


( 1 ) 


where Sg is the standard Wilson-plaquette action for the pure gauge sector with 
a coupling strength f3 = Q/qq and the bare gauge coupling. The Hermitean 
matrix Q, defining the fermion action, is given by 

Q(^U')xy 0^5[(f T ^ m/(^)])^x,y 


Cm 


}1U 


^^{(1 'yfl)Ufj^{x)6x+fj,,y 'Jfj.)ujj^{x fi)6x-iJ,^y}] , ( 2 ) 


where k is the hopping parameter, related to the bare quark mass mo hy k = 
1/(8 -|- 2mo) and Csw is the 0(a) improvement coefficient |^. The constant cm 
serves to optimize simulation algorithms and Cq = (1 -|- 8k)“^. For all practical 
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simulations we have imposed an even/odd preconditioning and hence used the 


preconditioned matrix Q, whose precise definition can be found in e.g. ^ 


It is the aim of the numerical simulations to compute expectation values of gauge 
invariant operators O 


{O) = z-^ 


I VUe-^^^^^det{Q^[U])0[U] , 


(3) 


using Monte Carlo methods. Note that in eq.(^ the square of the determinant 
appears in order to have a positive definite measure suitable for the numerical 
algorithms employed below. 

In the PHMC algorithm a polynomial Pn^eiQ"^), approximating for all 

eigenvalues A of with A G [e, 1], is introduced such that det(P“^^(Q^)) ~ 
det((5^). Using the trivial identity det((5^) = det((5^Pn,e(Q^))/det(P„^e(Q^)) and 
representing the determinants with the help of auxiliary bosonic fields 0 and rj, 
carrying colour and spin indices, one may exactly rewrite the partition function 
eq.(P as 


Sp 

S, 


j VUV(j)^V(j)Vr]^Vr] W 


rj^rj . 


(4) 


In eq.(^) we have introduced the “correction factor” W = W[r], U]: 

lU = exp {ri\l - [Q^ ■ Pn,e{Q^)]~^)r]} ■ (5) 

Denoting averages evaluated with the effective action Sg + Sp + as (...)p, the 
exact averages denoted as (...) are obtained by reweighing with W 

{O) = {W)-p^{OW)p . ( 6 ) 


As mentioned in the introduction, the advantage of rewriting the partition func¬ 
tion in the form of eq.(^ is that by a suitable choice of the polynomial Pn,e{Q‘^) 
the eigenvalue spectrum of can be smoothly separated into a part to be in¬ 
cluded in the update procedure by simulating the effective action Sg + Sp and a 
rest, taken into account in the correction factor. 

We remark that, in analogy to the case of the multiboson technique |^, the 
PHMC algorithm is also suited to allow for performing simulations with an odd 
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number of flavours. Of course, the above procedure leading to eq.(§) may be 
generalized and several polynomials may be introduced in such a way that each 
of them gives a good approximation in different parts of the eigenvalue spectrum. 
We demand in this case that the product of all these polynomials approximates 
the inverse of Q^. The realization we are using in this paper amounts to cutting 
out the very low-lying end of the eigenvalue spectrum from the update step. 

In principle, there is a great flexibility in choosing the polynomial to approximate 
Q~^. In this work we follow ref.|^ and choose a Chebyshev approximation method 
to construct Pn,e{Q‘^)- Since the polynomial we are going to use is detailed already 
in [^, ^ we will give here just its final form written in the product representation. 


2n 


PnAQ^) = Pn,e{Q) = 11 [V^iQ “ ^k)] , 

k=l 

where the complex numbers are given by 


(7) 


Tk = \/% = pfc + Wfc , > 0 , k = l,...,n 

rk = ^L+i-fc , fc = n + l,...,2n 

Zk = -(l + e)--(l + e)cos(^^^)-zV7sm(-^) . (8) 

The overall normalization constant, 0^=1 Cfc, can be computed analytically. If the 
Cfc’s are taken all identical, they turn out to be of 0(1). 

The polynomial Pn,t{Q‘^) approximates the inverse of with a relative fit error 
which is bounded from above by 


5 = 2 



(9) 


for all the eigenmodes of with eigenvalues A in the interval A G [e, 1]. The 
operator is normalised (through the choice of cm) in such a way that its largest 
eigenvalue is always smaller than 1. For eigenvalues A < e the relative fit error 
quickly increases as A decreases. 


2.2 Implementation and cost of the PHMC algorithm 

The approximation of det(Q^) through the inverse determinant of the polynomial 
Pn,t{Q^) was first suggested in p. There it led to a completely local bosonic action 
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involving n copies of bosonic fields. Since the bosonic action in that case was local, 
algorithms like heatbath and over-relaxation could be used. One unfortunate 
property of this approach was the observation that the autocorrelation time of 
the algorithm grows with the number of bosonic helds appearing in the action 
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The approximate fermion action Sp eq. i) ii on the other hand still represents 
a non-local bosonic action. In this approach one therefore has to rely on small 
step size algorithms. However, the advantage is that now only one dynamical 
bosonic held is needed and hence the dangerous increase of the autocorrelation 
time with the number of bosonic held copies mentioned above is avoided. 

In more detail, we have chosen to use a suitably adapted ^-version of the 
HMC algorithm for the update of the gauge helds. The usual arguments, includ¬ 
ing the reversibility of the molecular dynamics evolution, leading to the proof of 
detailed balance, still holds for the case of the PHMC algorithm. The implemen¬ 
tation of this update method for the case of 0(a) improved fermions and even/odd 
preconditioning can be done in complete analogy to ref. We therefore only 


want to point out some peculiarities which are not discussed in |23 


In the following discussion we will be somewhat sketchy and focus our attention 
on the modihcations of the standard ^-version of the HMC algorithm that are 
needed for implementing the PHMC algorithm. In particular, we note again that 
it is to be understood that for the actual simulation the preconditioned matrix 
Q was always used. Another remark is that the roots r^, k = 1,... ,2n were 
suitably reordered with respect to their dehnition, eq.(D, while preserving the 
relation r 2 n+i-k = ^k- Such a reordering is necessary to keep rounding errors 
on a tolerable level, as thoroughly discussed in [Q. Details of the different 
ordering schemes we have used in our implementation of the PHMC algorithm 
and rounding errors associated with them are discussed in Section 4. 

In the PHMC algorithm the variation of the pseudofermion action, Sp eq.(^, 
with respect to a given gauge link is somewhat more complicated than in the 
standard HMC algorithm. In terms of the variation of the operator Q, denoted 
by 5Q, it assumes the form 


5Sp = 


[6Q Xj-l ® Xln-j + X2n-j ® x] 


i=i 



( 10 ) 
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where the auxiliary pseudofermion helds Xji J = 1) ■ ■ ■ > 2n — 1 are defined as 
Xj = - Tj)] ■ [yc^(Q - r^-i)] ■ ... ■ - ri)]0 (11) 


and 0 denotes the pseudofermion field of eq.(^). In eq.(p!0D the products x ® 
denote direct products in colour space and a trace over spin indices is understood. 
In order to speed up the simulation and minimize memory requirements we pro¬ 
ceed for the computation of 5Sp as follows: We first precalculate the n vectors 
Xki for all /c = 1,..., n and store them. We then start the evaluation of the differ¬ 
ent contributions to 5Sp by computing Xn-i ® xli and its Hermitean conjugate, 
which for brevity will not be mentioned explicitly in the following. The next 
contribution io 5Sp would involve Xn-2 ® xl+i- The vector xL+i is obtained by 
computing {Q — r„+i)xn- The resulting vector can now be stored in Xn-i since 
this vector is no longer used. Iterating this procedure results in a memory re¬ 
quirement of u -|- 1 pseudofermion vectors. This may be considered as a drawback 
of the PHMC algorithm as it requires a substantial amount of memory if the 
degree of the polynomial becomes large. However, as it will be discussed below, 
there are several ways to overcome possible bottlenecks if not enough memory is 
available. 

It is clear that the evaluation of all terms necessary to evaluate dSp amounts 
to (2n — 1) Q(/) operations (the extra work to incorporate the roots in the 
operator Q — is completely negligible). In addition, since there are n terms 
to be summed (and traced) to evaluate eq. ([T0|) and since each of them requires 
a computational work roughly equivalent (at least in our implementation on the 
APE computers) to one Qcj) operation, the complete cost of the computation of 
6Sp will become about 3n Q(j) operations. 

Although the polynomial approximation to is rather precise even if a few 

eigenvalues of occur that are slightly larger than one, the numerical construc¬ 
tion of ^S'p itself turns out to be unstable when eigenvalues very close to 1 are 
met in the updating procedure. At least in our implementation of 6Sp, based on 
eq. ( 0 ), numerical overflows occurred when updating gauge configurations car¬ 
rying modes of with eigenvalues very close to (even if smaller than) 1. In 
practise, Q should therefore be normalized, through cm in eq.(^, such that the 
average highest eigenvalue of is sufficiently smaller than one, say (Amax) ~ 0.9. 
Since the value of the highest eigenvalue of shows very small fluctuations, such 




an appropriate normalization can safely be done at the beginning of a simnlation. 
The psendofermion held 0 in eq.(|[) is to be generated according to the distribntion 
exp {—S'p[17, 0]}. Generating this distribntion via a heatbath step involves the 
compntation of the inverse sqnare root of Pn,e{Q‘^\U]) . This can be achieved by 
compnting 0 throngh 

<l> = Al,(Q)[Q^PnAQ'‘T"Q'‘Ra (12) 

where Rq is a random Ganssian vector and An^e is given by 

n 

An,e{Q) = n V^iQ - ■ ( 13 ) 

k=l 

The vector X = [Q‘^Pn,e{Q‘^)]~^Q‘^RG is compnted with a Gonjngate Gradient 
(GG) method, solving the eqnation Q'^Pn,e{Q^) X = Q^Rg- We demanded that 
in generating 0 with a GG inverter the relation 

-0(10-^) (14) 

holds. We noticed that this can be achieved by choosing a moderately large 
stopping criterion for the GG solver, namely Cstop = 10“^^, where Cstop is dehned 
by the norm of the residnal vector 

^res = Q‘^Rg - Q^PnAQ") ^ (15) 

divided by the norm of the solntion vector X (which is nnmerically close to the 
norm of Q'^Pn,e{Q^) X in all practical cases): 

6stop=||<h,esilVl|Wf. (16) 

A last remark concerns the second psendofermion held r]. It is generated trivially 
by Ganssian random vectors. Throngh it the correction factor W = W[ri,U] 
(eq.(|^)) can be compnted via the solntion of the eqnation [Q^Pn^^{Q^)]X = r], 
which involves an additional inversion of Q^Pn,e{Q‘^)- The correction factor, or 
w = log(iy) is then obtained by 

wlri, U]^r,\l- lQ"P„,,(Q")]-')r,. (17) 

Since the expression Q‘^Pn,eiQ‘^) is almost the nnit matrix, there is the possibility 
of dangerons ronnding errors when compnting the vector (1 — [Q^Pn,e{Q^)]~^)v 
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in eq. ([l7|) , especially on machines with only 32-bit precision []. However, eq. (p!7D 
may be rewritten as 

w[7],U] = r]\RnAQ^)[Q^PnAQ^)]~")v ■ (18) 

Following refs. the polynomial Rn,e{Q^) = Pn,e{Q‘^) — ^ is directly given 

by Chebyshev polynomials of degree n-fl. One may hence use numerically stable 
recursion relations to compute Rn,e{Q‘^)- Although the use of eq. (^), instead 
of eq. (O), leads to a somewhat larger cost for evaluating the correction factor, 
our experience is that it is advisable to use eq. ([T8|) when only 32-bit precision 
is employed. Analogously to the case of generating the pseudofermion field 0, 
eq.(^, we optimized the value of the stopping criterion also for the CG inversion 
needed in eq. ([T8|) . 

It might be observed^] that eq.fpAj) can be generalized to 

W[r], U] = - [hn,eQ‘^Pn,e{Q‘^)\~^)ri , (19) 

where bn,e is some real positive constant. Its value might be optimized, depending 
on the values of n and e, in order to reduce the stochastic noise associated with 
reweighing through the correction factor. However, we did not exploit this addi¬ 
tional freedom and took always = 1, which enabled us to use the expression 
of w[ri, U] in eq. (|I8|) . 

In principle, the ratio of the number of r^-held “updates” to the number of gauge 
field updates is arbitrary. In fact, it turns out (see Section 3) that it is advan¬ 
tageous to choose this ratio to be larger than one. In this way, the additional 
noise induced in the reweighted observables, eq.(|[), by the correction factor can 
be partly suppressed. The above-mentioned ratio will be denoted in the following 
by Ncorri since it gives the number of computations of the correction factor per 
gauge field configuration. 

From the discussion above it is easy to express the cost of the PHMC algorithm in 
terms of matrix times vector, Qcf), operations. The cost for the PHMC algorithm 
can be split into three parts, 

Cq^{PHMC) 

Gbhb T Gupdate T Ccorr j (20) 

^We remark that, of course, all internal products and global sums were performed in software 
Kahan or double precision arithmetic. 

^We are grateful to Ulli Wolff for this interesting remark. 
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where Cbhb is the cost for the heatbath of the bosonic helds, ^update the cost for 
the computation oi 6Sp and Ccorr the cost to evaluate the correction factor. In 
units of Q(j) operations we hud 

{2n + 2) ■ + n 

3tI ■ Nstep 

{2n + 2) ■ ■ Ncorr . (21) 

The factor Ncorr denotes as above the number of evaluations of the correction 
factor W per full gauge held update (or molecular dynamics trajectory). The 
symbols and Nqq denote the average numbers of CG iterations in the 

heatbath of the bosonic helds and the computation of W, respectively. The factor 
3n in Gupdate comes from adding the cost for the construction of the auxiliary helds 
Xk and the cost of the other algebraic operations needed for a single update of 
the gauge held and its conjugate momenta. iVstep is the number of steps used 
in a trajectory, i.e. how often 6Sp has to be evaluated within a trajectory. We 
explicitly verihed that our formulae for Gupdate, Gbhb and Gcorr agree with the 
costs in real time observed for our implementation of the PHMC algorithm on 
the APE computer. 

The scaling behaviour of the computational cost Gupdate, eq.([^), as a function 
of the lattice size, x T, or the condition number of is expected to be 
fully analogous to the one observed in the HMC algorithm, with one important 
diherence. Due to the form of the variation of the pseudofermion action, Sp 
eq.(^, in the molecular dynamics evolution for the PHMC algorithm the role of 
the lowest eigenvalue of is taken over by the infrared cut-off parameter of 
the polynomial approximation, e, as already discussed in |^. Since in practise 
e ~ 2(Amin), we expect therefore an improvement on the cost of a simulation. 


Gbhb 

Gupdate 

Gf-orr 


3 Effects of the correction factor 

In this section we want to discuss the effects of the correction factor we introduced 
for the exactness of the algorithm. The hrst main point concerns the statistical 
fluctuations induced by reweighing observables with the correction factor: this 
aspect determines to a large extent the tuning of the PHMC algorithm. The 
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second point is of qnalitative nature and concerns the occurrence of gauge con¬ 
figurations with exceptional eigenvalues of and how the reweighing procedure 
can deal with them. 

3.1 Statistical errors and reweighing 

As discussed above, an important ingredient of the PHMC algorithm is the cor¬ 
rection factor. The computational cost of the algorithm will depend in a crucial 
way on the behaviour of the correction factor in a real simulation. The reason 
is that through the correction factor, which is computed stochastically, a cer¬ 
tain noise is introduced which may affect the errors on the observables and will 
contribute therefore to the cost of a simulation. 

In the PHMC algorithm the update of the gauge field U is alternated with Neon 
“updates” of the pseudofermion field rj, yielding Neon evaluations of W[ri, U] on 
each gauge conhguration. Performing a simple arithmetic average of them yields 
a single estimate of the correction factor per each gauge configuration. As a 
consequence, on a sample of N gauge configurations labelled by the integer j, the 
averages ■ ■)p introduced in section 2.1 can be represented as trivial arithmetic 
averages over the sample: 

N 

{OW)p = N-^ Y. ( 22 ) 

i=i 

where Oj is any gauge invariant observable and Wj the above alluded estimate 
of the correction factor on the gauge configuration Uj. 

For any finite number of configurations, the statistical error on (O) is expected 
to depend on the choice of n and e, i.e on the chosen polynomial approximation 
to and, for a given polynomial approximation, also on the value of Neon- 

By rewriting the expression for the reweighted average of O, eq.(^), in the form: 

{O) = {0)p + {W)-p^ ■ {{OW)p - {0)p{W)p) , ( 23 ) 

it becomes clear that both the statistical fluctuations of O and those of the con¬ 
nected part of OW contribute to the statistical error on the reweighted average, 
eq.(^). The latter contribution will depend on the statistical correlation between 
the observable O and the reweighing factor W . 
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As the polynomial approximation to is made more precise, the contribn- 

tion to the error on O coming from the statistical correlation between O and W 
becomes smaller, jnst becanse W gets closer to 1. Still, in the limit Ph ~ 1, there 
remains some noise, becanse W is compnted not exactly bnt only stochastically. 
We are then left with a pnre Gaussian noise factor. 

Choosing a poor polynomial approximation to for the update step in 

the PHMC algorithm yields a reweighing factor W which will strongly fluctuate. 
(Think, e.g., of W as being the full determinant.) Moreover, W may have in 
general a non-negligible correlation with the observable O, such that, even in the 
limit Worr —>■ oo, a large contribution to the error on (O) is expected to arise. 
The discussion suggests that it is the relative statistical error of the correction 
factor itself that controls the additional fluctuations induced by reweighing and 
hence the statistical error for a given observable (O). 

As a consequence, we can expect the PHMC algorithm to be found efficient only 
in situations where the variance of W is very small. This amounts to choosing 
the value of e to be of the same order as the average lowest eigenvalue of (Q^) and 
the value of n to be large enough for the polynomial approximation to 
to be reasonably precise. As we will see below, situations of this kind can be 
realized, in practice, by setting e ~ 2(Amin) and n such that the £t accuracy 
5 ~ 0.01, see eq.(^. When this criterion is respected and hence the parameters 
of the polynomial are hxed, the statistical error on {O) will only be a function of 
Worr- ffi fhe following we will see which values of Worr are sufficient to keep the 
error on {O) small. 

A most important quantity in determining the cost of a simulation of a given 
algorithm is the autocorrelation time. Since we are using the correction factor 
to render the algorithm exact, all observables have to be computed as a ratio 
of {OW)p and {W)p\ hence it is not obvious how to define the autocorrelation 
function of (P, in terms of which the integrated autocorrelation time rint(C^) is 
usually dehned. We “dehne” the integrated autocorrelation time for a given 
observable O by means of the expression which can be derived in the ordinary 
case, when no reweighing occurs: 



(24) 
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where denotes the naive and o'{0) the true error on the observable O. 

In order to obtain a reliable estimate of the true error on O in all of our tests, 
which are discussed below, we have used a single elimination jack-knife procedure. 
The jack-knife procedure has been then combined with a binning analysis by 
blocking the data into blocks of length Lbiock- Our error analysis follows closely 
the discussion in m (see section 5.2 there). 

We have run K replica in parallel and determined the true error in two ways: In 
the first approach we average on each replicum separately. Since the averaged 
data are statistically independent, we can estimate the true error by looking 
at the naive dispersion of them with respect to their arithmetic average. The 
relative error on the error in this case can be estimated as In the 

second approach, we divide the sample into blocks of size Tbiock, so that the total 
number of blocks is given by Wiock = KNtraj/Luock, where Wraj is the number 
of trajectories obtained per replicum. Of course, Tbiock is to be constrained by 
the requirement that data coming from different replica never appear in the same 
block. The error can then be computed as a function of the block length Lbiock- 
For a large enough block length, a plateau behaviour sets in from which we then 
determine the true error. The relative error on the error in this procedure can be 
estimated as (2Wiock)~^^^- 

Even when we have determined the true error on an observable as discussed above, 
the dehnition of the naive error on (O), and consequently the autocorrelation time 
Tnt(Ci), is not obvious, again due to the occurrence of the reweighing factor W in 
the dehnition of (O). A possible dehnition of the naive error on (O) is given by 
the single elimination jack-knife error for a block length of Tbiock = 1- We remark 
however that the variance of O: 

{O^W)p 


v{o) = {o^) - {oy = 


ipwfp 


(26) 


(MOf mi 

is an observable itself and should hence be independent of a particular algorithm 
used to compute it. The above observation suggests another dehnition of the naive 
error on (O), which is the one adopted in our studies of the PHMC algorithm: 

^naive((^) _ _ 1)-1 V(C))] . (26) 

with N = KNtray Note that only for W = 1 both dehnitions of the naive error 
have to agree. 
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Figure 1: The integrated autocorrelation time for the plaquette, 2rint(-P), versus 
(Tp{W)/{W)p, for several values of e: e = 0.046 (empty hexagones), e = 0.036 
(hlled hexagones), e = 0.026 (empty triangles), e = 0.016 (hlled triangles). For 
each value of e four values of n (8, 12, 16, 20) are considered: the smaller is n, 
the larger is the corresponding value of ap{W)/{W)p in the plot. 


3.2 Tuning of the PHMC algorithm 


In order to investigate on a quantitative level the tuning problem for n, e and Ncorr, 
we have run the PHMC algorithm on a 4^ lattice with Schrodinger functional 
boundary conditions ^ for a number of choices of n and e. To be more 
specihc, we have set Ct^go) = 1 and 6 = n/b. At the boundary at time t = 0 the 
gauge helds were set to classical helds denoted as point “A” in |2^. Finally, the 
gauge helds at time t = T were set to be identical to the one at t = 0. We remark 
that since we have chosen the gauge helds to be identical at both time boundaries, 
we do not have exactly the same boundary conditions as the ones in [^. The 
simulation parameters were chosen to be Csw = 0, /3 = 6.4 and k = 0.15. Although 
in this situation the average condition number for was only about 60 we still 
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Figure 2: The cost of obtaining a statistically independent measurement of the 
plaquette, 2rint(-P) ■ Cq^ is plotted versus ap{W)/ (hF)p, using the same symbols 

as in Fig.|. 


hnd a signihcant dependence of the simulation cost on the algorithm parameters 
such that a sensible tuning can be performed. For each choice of n and e about 
20000 trajectories were generated using a step size of 5t = 0.25 and the number 
of molecular dynamics steps = 4. These parameters were chosen to yield 
acceptance rates of about 80%. The values of Neon were varied from 1 to 4 or 1 to 
10 in these simulations. These values for Noon turned out to be sufficient to look 
for an optimal value minimizing the computational cost. We will consider here 
mainly two observables, the plaquette and the lowest eigenvalue of denoted 
by P and Amin, respectively. We mention that we monitored also the largest 
eigenvalue of and the reweighing factor itself. Within statistical errors the 
average values for all the considered observables agree among all our simulations 
with the PHMC algorithm and with the corresponding results obtained from the 
HMC algorithm. 

We start by showing the integrated autocorrelation time of the plaquette observ- 
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able in Fig. |I|. As discussed above we expect the cost of a simulation using the 
PHMC algorithm to depend strongly on the relative fluctuation of the correction 
factor. We therefore plot the integrated autocorrelation time as a function of 
(Tp{W)/{W)p. Here the subscript P (not to be confused with the symbol for 
the plaquette) reminds that the mean value and the standard deviation for the 
correction factor W do not involve, of course, any reweighing and 

1/2 


ap{W) = {Nuock - l)-\{W^)p - {Wyp) 


(27) 


For the hgure we have chosen four values of e (0.046, 0.036, 0.026, 0.016) and 
n (8, 12, 16, 20). The smaller n is, the larger is the corresponding value of 
ap{W)/{W)p in the plot. For each choice of n and e, we took the value of Worr 
that turns out to minimize the quantity 2rint(P)CQ# (see below). The point at 
ap{W) / {W )p = 0 belongs to the integrated autocorrelation time as obtained from 
the HMC algorithm. It is clearly seen that when increasing 0 'p(fF)/(fF)p the 
integrated autocorrelation time assumes large value. For crp{W)/(W)p < 0.01, 
the dependence of the autocorrelation time becomes weak and no preferred choice 
of n and e can be given. 

In Fig. ^ we show the total cost by computing 2rint(P)C(3$, with the cost factor 
given in eq. (|20|) , again taking for each n and e the value of Worr that mini¬ 
mizes 2rint(P)C'Q$ itself. Here we find that for crp{W)/(W)p ~ 0.01 one reaches 
the minimal cost of the algorithm. We also give, at (Tp{W)/{W)p = 0, the cost 
of a corresponding HMC simulation. As already mentioned in the cost ob¬ 
tained from the PHMC algorithm is significantly lower. On the other hand, for 
upiW)/{W)p > 0.01 the cost from the PHMC algorithm increases, which is a 
direct consequence of the increase of the autocorrelation time observed in Fig. [^. 
For up(W)/(W)p -C 0.01 we also find an increase of the cost of the simulation. 
This is a consequence of the fact that more precise polynomial approximations 
have a higher computational cost without giving any sensible reduction of the 
autocorrelation time rint(P) and, correspondingly, of the statistical error a(P). 
The corresponding results for Amin look qualitatively similar, although with a 
somewhat stronger dependence on ap{W)/(W)p. Also in this case we find the 
optimal value of up{W)/{W)p k. 0.01. 

After having identified the optimal values for n and e, it is interesting to study the 
behaviour of the statistical errors as a function of the values of Ncow To this end. 
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L^xT 

Algor. (Worr) 

<P> 

< Amin(Q^) > 

44 

HMC 

0.66179(5) [12] 

0.01582(3) [8] 

44 

PHMC(4) 

0.66186(5) [12] 

0.01582(3) [8] 


PHMC(3) 

0.66185(5) [12] 

0.01583(3) [8] 


PHMC(2) 

0.66185(5) [12] 

0.01583(3) [8] 


PHMC(l) 

0.66185(5) [12] 

0.01583(3) [8] 


PHMC(O) 

0.66221(5) [12] 

0.01451(3) [8] 

44 

PHMC(IO) 

0.66188(5) [18] 

0.01588(3) [16] 


PHMC(9) 

0.66188(5) [18] 

0.01584(3) [16] 


PHMC(7) 

0.66198(5) [19] 

0.01586(3) [17] 


PHMC(5) 

0.66198(5) [20] 

0.01581(3)[17] 


PHMC(4) 

0.66201(5) [22] 

0.01584(03) [18] 


PHMC(3) 

0.66213(5) [23] 

0.01575(3) [19] 


PHMC(2) 

0.66215(5) [28] 

0.01569(3) [22] 


PHMC(l) 

0.66218(5) [36] 

0.01553(3) [24] 


PHMC(O) 

0.66272(5) [16] 

0.01218(3)[8] 


Table 1: The behaviour of mean values and statistical errors for the plaquette 
and the lowest eigenvalue of as a function of Ncorr- data are obtained with 
the HMC and the PHMC algorithms. For the latter we have considered the 
parameters n = S, e = 0.036 (data set with A^corr ranging from 1 to 10) and 
n = 16, (data set with iVcorr ranging from 1 to 4). The statistics has been 21000 
trajectories in all cases. We give in round brackets the naive error and in square 
brackets our estimate for the true error. An arrow points towards the line where 
the value of Worr turns out to be basically optimal. The case Worr = 0 in the 
PHMC data refers to the results obtained with no reweighing. 

we have chosen two different values of n and e. The hrst one, n = 16 and e = 0.026 
corresponds to (Tp{W)/{W)p ~ 0.01 and is therefore considered to be close to 
the optimal value. The other choice is u = 8 and e = 0.036, which gives a value 
of upiW)/(W)p ~ 0.032 and is clearly far from being optimal. The mean values 
for the plaquette P and eigenvalue Amin as well as the naive (round bracket) 
and true (square bracket) errors are given in table |^. The value of Worr = 0 
corresponds to the case where no reweighing is performed, which is expected to 
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yield systematically wrong results. For the non-optimal choice of n and e we 
observe a strong dependence of the statistical errors on iVcorr- The lowest values 
of a are obtained only for Ncorr = 10, i.e. the largest of the considered values of 
Ncorr- Even the values of a corresponding to Neon = 10 are still somewhat larger 
(especially for Amin) than both the statistical errors for iVcorr = 0 and from the 
HMC algorithm. This behaviour closely corresponds to what is expected from the 
above discussion about the statistical noise induced by the reweighing procedure. 
On the other hand, when considering the optimal choice of n and e, the statistical 
errors from the PHMC algorithm do not show any visible dependence on Noon and 
basically coincide with the ones from the HMC algorithm. Finally we remark that 
in all cases the mean values are consistent among themselves within the measured 
statistical errors. Moreover, the naive errors, dehned according to eg. (]26|) , are 
also consistent among all cases considered here. 

The behaviour of the error on the plaquette and Amin was also tested on an 8^ 
lattice for parameter values f3 = 5.6, k = 0.1585 ~ Kc, Cs„ = 0. The Schrodinger 
functional boundary conditions that we adopted were chosen to be the same as 
for the lattice mentioned above. The only difference is that the boundary 
improvement coefficient was set to its 1-loop value, i.e. Ct{go) = 1-0 — 0.089(7o- 
The statistics in this case is 2700 trajectories. We refer to ^ for a more detailed 
information about the algorithmic parameters and give our results in table |^. 
We compare the results obtained using the PHMC algorithm (in the setup with 
K = 32 replica) with the ones obtained using the HMC algorithm. We performed 
also a control run for the PHMC algorithm on only 1 replicum running up to the 
same statistics of 2700 trajectories. This gave completely consistent results, as 
it should, of course, and provided us with a further check of our estimate of the 
uncertainty on the true error given in braces in table |^. From table ^ we infer 
that in this case the practically optimal value of Worr appears to be 2 or 3 and is 
hence again reasonably small when a good polynomial approximation is chosen. 
The results on the 8^ lattice were obtained by taking n = 48 and e = 0.0026, 
yielding a relative £t error 6 ~ 0.013. This value of 6 is even slightly larger 
than the relative £t error corresponding to the optimal choice of n and e on 
the lattice 4^, when the condition number of was about 10 times smaller. 
This seems to indicate that, even if the statistical fluctuations of the correction 
factor are expected to increase with the lattice volume and the condition number 
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xT Algor. (A^corr) 

<P> 

< Amin(Q^) > 

8^ HMC 

0.57251(04)[12]{3} 

0.001310(10)[51]{8} 

8^ PHMC(4) 

0.57253(5) [14]{3} 

0.001318(10)[50]{8} 

PHMC(3) 

0.57248(5) [14]{3} 

0.001318(10)[50]{8} 

^ PHMC(2) 

0.57249(5) [15]{3} 

0.001328(10)[50]{8} 

PHMC(l) 

0.57260(5)[19]{5} 

0.001310(10)[60]{10} 

PHMC(O) 

0.57272(5) [12]{2} 

0.001141(10)[45]{7} 


Table 2: The behaviour of mean values and statistical errors for the plaquette 
and the lowest eigenvalue of as a function of A^"corr on a 8^ lattice. The 
notation is the same as in table ^ The numbers in braces give our estimate of 
the uncertainty on the true error. 

of it might be unnecessary to take polynomial approximations more and 
more severe. This might be explained by the observation that in this case also 
autocorrelation times generally increase, leading to a larger number of evaluations 
of the correction factor on statistically correlated gauge configurations; in some 
cases, moreover, the statistical fluctuations of physical observables increase, too. 
However we think that further and much more time-consuming studies are needed 
to clarify the issue. 

3.3 Exceptional eigenvalues 

So far, we have discussed the PHMC algorithm for situations where no excep¬ 
tional eigenvalues of Q^, i.e. those that are orders of magnitude smaller than the 
average lowest eigenvalue, appear. However, the PHMC algorithm is designed to 
allow especially for the occurrence of gauge field configurations carrying excep¬ 
tionally small eigenvalues of Q^. In fact we expect the probability of generating 
such conhgurations with the PHMC algorithm to be considerably larger than the 
corresponding probability when using the HMC algorithm or exact versions of 
the multiboson technique with accept/reject step. 

This expectation is indeed confirmed in a real simulation, as can be seen from 
Fig. 1^. There we plot the distribution of the lowest eigenvalue of as obtained 
from simulations with the HMC and the PHMC algorithms. The parameters for 
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the runs were chosen to be /? = 5.4, n = 0.1379 and Cs* = 1.7275. The lattice 
size was 8^ x 16 and Schrodinger functional boundary conditions were adopted 


as specihed in [18 


Clearly, the distribution obtained from the PHMC algorithm stretches to much 
smaller values of A. However, after reweighing with the correction factor, the 
average lowest eigenvalue obtained in this case from the PHMC algorithm takes 
a value consistent with the one obtained from the HMC algorithm. In Fig. ^ 
we show the (Monte Carlo) time evolution of the 10 lowest eigenvalues. We 
find that there is a band of eigenvalues at a level roughly corresponding to the 
average lowest eigenvalue and that only occasionally an isolated eigenvalue gets 
very smallQ. This is exactly the situation anticipated in ref. Q] . As also discussed 
there, if Amin -C 1, when computing the correction factor exactly on each gauge 
conhguration, i.e. taking Ncorr = oo, W = det[Q^Pn,e{,Q^)] turns out to be 
proportional to Amin- Hence the correction factor serves the purpose of cancelling 
divergences in certain quark Green functions. In the following discussion we 
neglect the distinction between the operator Q^, which is certainly the relevant 
one for quark Green functions, and some preconditioned form of it, which may 
be conveniently used in the update and reweighing procedures. Indeed, doing so 
does not affect our conclusions and keeps the discussion more general. 

In practise the evaluation of the correction factor on gauge configurations carrying 
exceptionally small eigenvalues may be problematic, since the badly conditioned 
operator Q^Pn,e{Q^) has to be inverted and Worr is usually taken to be a finite 
(relatively small) number. 

We see from eq.(^) that the quantity [Q‘^Pn,e{Q‘^)]~^V is needed for the evalu¬ 
ation of the reweighing factor as described above. The inversion of Q‘^Pn,t{Q‘^) 
is performed by using a CG algorithm, where suitable vectors are multiplied by 
Q^Pn,t{Q^) several times. As discussed in |Q, the multiplication of Q^Pn,e{Q^) is 
affected by rounding-error effects, which can be kept on a tolerable level in nor¬ 
mal situations. However, on gauge conhgurations carrying exceptionally small 
eigenvalues of Q^, these rounding-error effects might be significantly amplified, 
especially for the components of [Q^Pn^e{Q^)]~^V having non-vanishing projection 
on the low lying mode eigenvectors. 


In some rare cases we have observed that the same happens for two or three eigenvalues. 
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Figure 3: The distributions of the lowest eigenvalue Amin of as obtained from 
the HMC and the PHMC algorithms. The quantity P(Amin) denotes the number 
of eigenvalues for a given bin, normalized by the total number of eigenvalues. 
Both runs have the same statistics. 
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Figure 4: The Monte Carlo time evolution of the five lowest eigenvalues from 
a simulation using the PHMC algorithm. We denote by the stars the lowest 
eigenvalue and for the open circles the remaining ones. 
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In order to see the potential problems arising from taking a finite value of Ncorr, 
let us introduce the eigenvalues Xj and eigenvectors \Xj) of Q^, 

Q^\X,) = X,\X,) . (28) 


Then the correction factor W = exp(t(;[r 7 , U]) of eq.(|^ becomes 


w[v,u] = + ■ (29) 

3 

Since i?n,e(A) —> —1 as A —> 0, all random helds 7] that have a sizeable projection 
on the low-lying modes of yield very large negative values of w and hence 
exponentially small values of W are obtained. The dominant contributions to the 
correction factor will come when the projection of the helds rj on the exceptional 
modes is almost zero. In practise the correction factor is evaluated stochastically 
by setting Neon to a small value and taking for the correction factor on a given 
gauge conhguration U the following noisy estimate: 

-^corr 

W[U; iVc„„] = iV-;, Wlv„ U] . (30) 

It is, of course, very unlikely that a held r] with almost vanishing projection on 
the zero mode will be generated. As a consequence, the noisy estimate of the 
reweighing factor obtained on a gauge conhguration carrying low-lying modes of 
is likely to be very imprecise. On the other hand, an exact evaluation of the 
reweighing factor of eq.(^) with Neon = oo will give that W oc Amin as desired. 
The above discussion makes it clear that the computation of the correction factor, 
as explained above for “normal” situations, should be generalized to deal with 
the case where exceptional eigenvalues occur. To this end we introduce another 
infrared cut-oh parameter e -C e and write the partition function as 


Sp 

S, 


j VUVAV(j)VAVri WbWjp e-^^^Sp+Sr,) 

Sp[U,cl3]=APnAQ"[U])cl3 

rjA 


(31) 


where now the original correction factor W is split into two parts, an “infrared” 
part, 

^m= n[l + ^n..(A,)] (32) 

Xj<e 
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and a “bulk” part 


WbVi, U] = exp ■ (Q^ . , (33) 

where 

\V±) = Ih)Aj)|Aj)(Aj|r7) . (34) 

3 

The infrared part of the correction factor Wm is very much in the spirit of ref. ||^ , 
where also the underlying assumption was taken that only a few isolated small 
eigenvalues occur. Exact observables are now computed through 

{O) = {WbWir)p\OWbWir)p . (35) 


In order to guarantee the exactness of the simulation algorithm, e has to be 
fixed in a given simulation. We give in appendix A a derivation of eq. (^) which 
explicitly shows how the splitting of the original correction factor W into Wr and 
WiR is fully determined a priori by the choice of e. Of course, when no eigenvalues 
smaller than e occur, Wjr = 1 and it has not to be computed. In the case that 
such eigenvalues occur, the two correction factors Wr and Wjr can be computed 
by evaluating all eigenvalues Xj < e and the corresponding eigenvectors. 
Obviously, Wr in eq.(|3^ receives no contribution from the low-lying modes {Xj < 
e) of Q^. This property of Wr justihes the expectation that a noisy reweighing 


with Wr in eq. (|35|) can be performed by choosing a small, finite value for the 
number of //-held “updates” Nr. From this point of view, the statistical noise 
induced by reweighing with Wr is expected to be quantitatively very similar to 
the one induced in “normal” situations by the reweighing factor W, eq.(^ and 
the values for Nr should in practise be similar to the values usually chosen for 
N 

’ r.nrr • 


We have tested the modified correction factor in practise by taking a gauge config¬ 
uration carrying a mode of with an exceptionally low eigenvalue (about 3 ■ 10® 
times smaller than the highest one). Indeed, the estimate of dei[Q‘^ ob¬ 
tained from the original reweighing factor W is very imprecise and converges very 
slowly to the correct value when increasing Worr- As a consequence one hnds a 
large variance of IT as a function of r]. On the other hand, the improved estimate 
of W[U] given by Wir[U]Wb[U-, Nr], where 

Ng 

Wb[U-, Nr] = iV-1 WrIpj, U] (36) 
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and e = e/10, is much less noisy already for pretty small values of Nb- In fact, 
the fluctuations are fully analogous to the case when no exceptional eigenval¬ 
ues are present. More quantitative information on this point will be given in a 


forthcoming publication |T9 


We remark that the problem of inverting the operator Q'^Pn,e{Q‘^) in the sub¬ 
space orthogonal to the one spanned by the low-lying modes of is always well 
conditioned, even in the presence of an exact zero mode. The evaluation of 
can be done by computing all the eigenvectors of corresponding to eigenvalues 
smaller than e. Since, as shown in Fig.l, there are usually only a few isolated 
eigenvalues below e, these eigenvalues and the corresponding eigenvectors can be 
calculated reliably by using the techniques described in . 

The level of precision needed in computing the low-lying eigenvalues and the cor¬ 
responding eigenvectors of is determined by requiring that the uncertainties in 
WiR and Wb, induced by the uncertainties on these eigenvalues and eigenvectors, 
be negligible (with respect to the statistical fluctuations of Wir and Wb)- Using 
the fact that ~ —1 -|- cA, with c = 0(e“^), the relative uncertainty on 

each factor in the product of eq. (|3^) can be estimated (for A 1) as 


[l + i?„,,(A)]-'5[l + i?„,,(A)]~A-ia, 


(37) 


where 5\ denotes the uncertainty on a given eigenvalue A. On the other hand, the 
uncertainties in the determination of the eigenvectors corresponding to eigenval¬ 
ues smaller than e directly affect the evaluation of \ri±), eq. (|3^) , and hence Wb, 

eq.®). 

The discussion above makes very straightforward the software implementation of 
the proposed reweighing procedure: once a value for e has been set, the program 
has just to check on each generated gauge configuration whether the lowest eigen¬ 
value of is smaller than e. Only in the affirmative case, of course, does the 
correction factor WbWjr actually differ from the usual one and an evaluation of 
WjR is required. 

It might be observed that, if the PHMC update ever generates gauge conhgura- 
tions with a significant fraction of the modes of belonging to very low eigen¬ 
values, the above described reweighing procedure becomes computationally very 
expensive. This is certainly true, but in that case troubles are also expected in the 
evaluation of quark propagators by ordinary CG-like inverters. Indeed, it is likely 
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that in such a situation a relatively precise knowledge of all the low-lying modes 
is needed even for evaluating the fermion Green functions. This can be done by 
splitting each quark propagator into two parts: the hrst part, to which only the 
low-lying modes contribute, should be expressed in terms of the known eigenval¬ 
ues and eigenvectors; the second part, to which no low-lying modes contribute, 
should be evaluated by inverting the operator Q in the subspace orthogonal to 
the one spanned by the low-lying modes of Q^. 


4 Rounding errors in the PHMC algorithm 


Rounding errors may become in principle a problem for all simulation algorithms. 
Each algorithm is designed to produce held conhgurations according to a prob¬ 
ability distribution, related to the Boltzmann factor of a given Euclidean held 
theory. The danger is that when implementing a code for some hnite precision 
computer, rounding errors may render the probability distribution of the actually 
produced held conhgurations somewhat diherent from the desired one. In par¬ 
ticular, when using molecular dynamics kind of algorithms like the HMC or the 
PHMC algorithms, the equations of motion, integrated numerically by a symplec- 
tic integrator, lack in principle the reversibility condition, resulting in an inexact 
algorithm. It is still an open question for what situations this systematic error of 
the molecular dynamics kinds of algorithm will become important in practise. 
The problem of rounding errors ought to be studied especially for the PHMC 
algorithm. As the discussion in section 2.2 showed, for an efficient computation 
of 6Sp in the PHMC algorithm, the product representation of the polynomial 
Pn,e should be used. However, the stability of the numerical construction of 
the polynomial in the product representation depends strongly on the ordering 
of the monomial factors in eq. Particularly “bad” orderings easily lead to 
substantial precision losses or even numerical overflow. As demonstrated in 


(see also [^|), there exist, fortunately, orderings of the monomial factors (or 
equivalently the roots of the polynomial) such that rounding errors can be kept 
on a perfectly tolerable level. 

Still, the rounding errors appearing for a particular ordering scheme applied in a 
given situation should be monitored carefully. In the generation of the pseudo- 
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fermion fields (j) one could in principle resort to numerically stable recursion re¬ 
lations. However, as discussed above, it is very easy to monitor the rounding 
errors in this case by evaluating the difference of eq.(|T^. In the evaluation of the 
correction factor, on the other hand, it is in general advisable to use a recursion 


relation |^, ^ in order to obtain a sufficiently precise result. In the following 
we will hrst discuss the rounding-error effects appearing in the construction of 
Q‘^Pn,e{Q‘^)'- this is the polynomial of highest degree among the ones which occur 
in the PHMC update and all the different polynomials of lower degree are nu¬ 
merically constructed by using the same ordering of monomial factors. We then 
turn to a discussion of the magnitude of reversibility violations. 


4.1 Rounding errors from the product representation 


As shown in [^, the Clenshaw recursion relation provides a very stable and 
precise way to evaluate the polynomial Q^Pn,e{Q^) = 1 + Rn,e{Q‘^), even when 32 
bit precision is employed everywhere, but in internal products and other sums 
over the whole lattice. This gives us the possibility of evaluating the size of the 
rounding errors when the polynomial Q^Pn,e{Q‘^) is constructed in its product 
representation, eq. (0). Following we consider the vector 


^ order ^ QV^iQ ■ ■■■ ■ \/^(Q - ri)QI?G 


(38) 


where we have taken the preconditioned matrix Q as it was used in all our nu¬ 
merical tests. The label “order” can stand for a particular monomial ordering 
scheme. In the following we will only discuss the bit reversal and Montvay’s 
schemes, which were found in to be the most precise. We refer again to [Q 


for a dehnition of the ordering schemes employed here. Given the numerical sta¬ 
bility of the Clenshaw recursion, a good measure of rounding errors, when only 
32 bit precision is employed, is the quantity 


A order 


Vn 


1 $ 


order 


- $ 


Clenshaw 


(39) 


In table || we give the results for Aorder for the bit reversal and Montvay’s ordering 
schemes. All results have been obtained on an 8^ x 16 lattice with Schrodinger 
functional boundary conditions as used for the computation for the 0(a) improved 
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Table 3: The quantity Aorder eq. is given for the bit reversal (BR) and 
Montvay’s ordering schemes. 


n 

e 

5 

Abr 

AMontvay 

16 

0.0030 

0.310 

3.1(1)10-® 

3.0(1) ■ 10-® 

32 

0.0030 

0.054 

3.3(1)10-® 

3.0(1) ■ 10-® 

64 

0.0022 

0.0045 

4.5(1)10-® 

4.2(1) ■ 10-® 

100 

0.0022 

0.0002 

8.4(2)10-® 

6.4(2) ■ 10-® 

100 

0.0010 

0.0034 

9.0(2)10-® 

6.4(2) ■ 10-® 

100 

0.0005 

0.0218 

10.8(2)10-® 

7.8(2) ■ 10-® 


action 


m 


1.42511. 


The parameters of the runs were (3 


6.8, K = 0.1343 and Cgw = 


We have chosen the constants c^, eq. (0), to be all identical. Choosing the c^’s 
different from each other (while keeping hxed their product which guarantees 
the proper normalization of ), changes the results in table at most at 
the 10% level. The results of table ^ are qualitatively very similar to the ones 
reported in [^. They show a growth of rounding errors in the construction of the 
polynomial Pn,e as n and e~^ increase (see the behaviour for n = 100). However, 
the magnitude of rounding errors for the cases considered in table are perfectly 
tolerable. In particular, no evidence for numerical instabilities or large rounding 
error effects has been observed. Since all our simulations are performed using 
either the bit reversal or Montvay’s ordering schemes for a range of values of 
n and e covered by the ones given in table we conclude that our numerical 
simulations are safe against rounding errors coming from the use of the product 
representation for the polynomial Pn,€- 


4.2 Reversibility violations 

For the purposes of this section, the evolution of the gauge held in the molecu¬ 
lar dynamics part of the PHMC algorithm can be summarized as follows: some 
initial held conhguration of the gauge helds and their conjugate momenta 
{Pin,Tn}; IS evolved from a hctitious Monte Carlo time t = 0 to the hnal 
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configuration {Uend, TTend} at t = T, with T usually set to T = 1 in production 
runs. This evolution is determined by the “equations of motion”, derived from a 
Hamiltonian H = ^ + S, where S is the total action. At t = T, the con¬ 

figuration {t/end, TTend} IS subjcct to an accept/rcjcct step using the values of the 
Hamiltonians hfin and i^end, as measured on the initial and hnal configurations, 
respectively. 

We recall that in evolving the gauge field conhguration in the Monte Carlo time 
a great flexibility is allowed. The imposed restrictions are -from a practical point 
of view- that the acceptance rate determined by H^nd — hfin should be reasonably 
large, about 80%, and -from a principal point of view- that the evolution in the 
Monte Carlo time ought to be reversible in order to guarantee detailed balance 
and consequently the correct importance sampling. 

The method of choice for the Monte Carlo time evolution is to evolve the sys¬ 
tem with the equations of motion using a leap-frog integrator. It was found, in 
particular when machines with only 32-bit precision arithmetic are used, that 
due to rounding errors, violations of the reversibility condition are encountered. 
What is worse, it appears that the equations of motion correspond to those of a 
classical chaotic system with a positive Liapunov exponent ^1], |^, ^ |35[| . 

As a consequence, rounding error effects are exponentially amplihed along the 
integration of the equations of motion. 

Using a leap-frog integrator -in particular on an APE machine with 32-bit arith¬ 
metic as in this work- needs therefore an estimate of violations of reversibility. As 
it was discussed at length in [^, in the PHMC algorithm some orderings of the 
monomial factors in the product representation can lead to large rounding-errors 
effects with a possible strong influence on reversibility violations. We therefore 
checked the magnitude of the reversibility violations when using the subpolyno¬ 
mial, the bit reversal and Montvay’s ordering schemes as described in |^. These 
tests were performed with the same parameters as in Section 4.1. In particular, 
the polynomial parameters were chosen to be n = 64 and e = 0.0022. 

To measure the reversibility violations, we simply started from the final config¬ 
uration {Lend, TTend}, reversed the sign of the step size dt and integrated back to 
reach the reversed configuration {Urev, TTfev}- In all our tests we used the higher 
order leap-frog integrator as suggested in (i.e. eq.(6.7) of that reference with 
n = 4). Our step size was chosen to be dt = 0.05 for both the forward and the 
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backward integration and the valne of the trajectory length was T = 0.75. On 
the initial and the reversed conhgurations we measured the corresponding Hamil¬ 
tonians i7in and i7rev and the plaquettes Pin and P^ev averaged over the gauge 
conhguration. The difference of these quantities, dH, dP and the norm difference 
dll of the gauge links 


\\dUf 

dH 

dP 


II Pin - Pn 


I Pin - Hr, 




_ TTX,^,a,l3\2 

Pin ^vev I 


IP - P 

-tin -t rf 


(40) 


serve as our quantitative measure of the reversibility violations. In eq.(^0D the sum 
extends over the lattice points, the 4 forward directions and the colour indices. 


Table 4: Reversibility violations for the PHMC and HMC algorithms, comparing 
different root orderings for the PHMC algorithm, subpolynomial (SP), bit reversal 
(HR) and Montvay’s ordering scheme. HR* indicates that the roots are calculated 
in 64-bit arithmetic. 


Scheme 

(IldPlI) 

{dH) 

SP 

9.45(1) ■ 10-® 

2.1(2) ■ 10-2 

HR 

1.293(1) ■ 10-6 

4.0(9) ■ 10-3 

HR* 

1.292(1) ■ 10-6 

2.8(8) ■ 10-3 

Montvay 

1.277(1) ■ 10-6 

3.4(9) ■ 10-3 

HMC 

6.7(2) ■ 10-^ 

1.4(6) ■ 10-3 


Our results, averaged over 32 conhgurations are given in table ^ for the subpoly¬ 
nomial (SP), the bit reversal (HR) and Montvay’s ordering scheme. We compare 
with the corresponding results from the HMC algorithm, using there the same 
number of steps and an equal step size as used in the case of the PHMC algorithm. 
For the HMC algorithm, in the Conjugate Gradient solver we have chosen a stop¬ 
ping criterion requiring that the squared norm of the residual vector, normalized 
by the solution vector, be less than e™*" = 

One clearly sees that the subpolynomial scheme gives substantially more re¬ 
versibility violations than the one encountered in the HMC algorithm. Within 


31 









the errors, the size of the reversibility violations from the PHMC algorithm with 
the bit reversal and Montvay’s scheme are of the same order as the ones from the 
HMC algorithm. We also considered the bit reversal ordering in the case, denoted 
as BR*, when the roots and the normalization factors are compnted with 64-bit 
precision and then read in. Within the errors we do not hnd any effect. In onr 
tests we could find no difference in the plaquette expectation value. We conclude 
that our results are not contaminated from reversibility violation effects. 

As mentioned above, there is a lot of flexibility to perform the evolution of the 
gauge field configuration in Monte Carlo time in the molecular dynamics part of 
the HMC algorithm. The PHMC algorithm establishes an approximation of the 
exact evolution. The crucial advantage of the PHMC algorithm is, of course, that 
this approximation is fully controlled and can be corrected for. Another possi¬ 
bility of approximating the Monte Carlo time evolution is to just use a larger 
stopping criterion for the inverter of . However, we think that the reversibil¬ 
ity violations, which are certainly present, may then become dangerous: due to 
rounding errors, when integrating the gauge helds backward in time, the inverter 
“sees” a different gauge field configuration from the one during the forward in¬ 
tegration. Therefore also the solution vectors will be different and when the 
stopping criterion is relaxed, this difference is enhanced, possibly leading to large 
reversibility violations. 

This effect is amplihed when one makes use of the “knowledge of the past”: the 
inverter is started with an initial guess, which is the solution of the previous 
inversion. This reduces the number of iterations in the inverter, since generally 
the movement of the gauge helds through conhguration space is smooth. The 
idea may also be iterated 1^^. However, in this way potential rounding errors are 
amplihed, since they are accumulated in the solution vectors. 

A possible solution may be to choose always a constant starting vector. Then 
reversibility violations only appear through the diherence in the gauge held conhg¬ 
uration. We tested this possibility for our implementation of the HMC algorithm 
and report our results in table A similar investigation has been performed in 
[p4[ . Here we have taken the same parameters as used for table averaging again 
over 32 conhgurations. 

As can be seen, already for a stopping criterion of 10“^° the reversibility violations 
are substantially larger than for the severe stopping criterion, showing even a 
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Table 5: Comparison of reversibility violations in the HMC algorithm using a 
constant starting vector in the Conjugate Gradient solver. 


.HMC 

^stop 


iimi) 

{dH) 

m 

1.0 ■ 

10- 

-14 

6.58(1) ■ 10-^ 

1.1(6) ■ 10-3 

— 

1.0- 

10- 

-12 

8.5(1) ■ 10-^ 

1.8(7) ■ 10-3 

— 

1.0- 

10- 

-10 

2.6(4) ■ 10“® 

7.8(1.8) ■ 10-3 

6(3) ■ 10-3 

1.0- 

10- 

-8 

4.7(3) ■ 10-5 

2.0(3) ■ 10-3 

6(1) ■10-^ 


difference in the expectation value of the plaquette. We conclude that on machines 
with 32-bit precision arithmetic the stopping criterion can not be relaxed too 
much. Since with a constant starting vector we loose the advantage of having 
a reasonable hrst guess for the solution of the inverter, we prefer using a severe 
stopping criterion and a better initial guess over relaxing the stopping criterion 
and using a constant starting vector. Of course, the situation might look different 
on machines with higher precision, where reversibility violations are suppressed. 
We would like to point out a second effect relevant for reversibility violations. 
When the stopping criterion is made large, it might happen that during the 
backward integration the inverter stops one iteration before or later than on the 
corresponding step in the forward integration. Since now the stopping criterion 
is large, the solution vectors are very different, leading to large reversibility vi¬ 
olations. The only way to overcome this would be to also keep the number of 
iterations constant. However, we feel that with this way of accelerating the algo¬ 
rithm the convergence of the inverter is not very well controlled, but we have not 
studied this situation in detail and we do not know how a possible poor conver¬ 
gence may affect the acceptance of the whole molecular dynamics trajectory. 

5 Memory requirements 

In this section we wish to discuss three possible ways of reducing memory re¬ 
quirements. The hrst two ways (sections 5.1 and 5.2) result in some reasonably 
tolerable computational overhead. The last way (section 5.3) was already shortly 
mentioned in 0 but it leads to a signihcant alteration of the dynamics. 
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Once again, we neglect in our discussion the technical complications arising 
from the use of even-odd preconditioning, which can however be treated as in 
any HMC-like algorithm. We recall here that the pseudofermion helds (j) and 
Xk, k = 1,2 ,..which will enter our discussion, are assigned to arrays dehned 
on all lattice sites. Indeed, even if only the odd-site components are needed in 
principle, we have found it convenient to make use of the even-site components 
to store intermediate results in the construction of the operator Q (see eq. (35) of 
[P^), which connects second neighbour sites on the original lattice. 

5.1 PHMC update with {Q‘^ — Zk) monomials 

It is, of course, possible to implement the PHMC algorithm by using also the 
product representation 

n 

PnM) = X{[Ck{Q‘^ - Zk)] (41) 

k=l 

with the roots Zk given in eq.(^. The variation of the action Sp then becomes 

njl 

ssp = Y.Ck 

k=l 

where the auxiliary pseudofermion helds ^k, for k = 1,... ,n — 1 are given by 

6 = [ck{Q^ - Zk)] ■ [ck-i{Q^ - Zk-i)] ■ ... • [ci((5^ - Zi)](j) . (43) 

Following the discussion in section 2.2, the evaluation of 6Sp in eq. (|4^) implies 
a memory requirement of only {n/2) + 2 pseudofermion helds, which means a 
reduction of basically a factor 2. 

However, if one insists on using only (n/2) + 2 pseudofermion helds, it seems 
impossible to avoid an overhead on the computational cost. In evaluating 6Sp 
in eq.(^2D one needs, analogously to the case discussed in Section 2.2, 3n Q<p op¬ 
erations. There appear, however, additional n/2 Qcj) operations for the following 
reason. When storing only the helds ^i,..., Cn /2 before starting the computation 
of 6Sp, one is “loosing” the information about the vectors Q^i ,..., Q^n/ 2 - 1 , which 
have already been calculated as intermediate steps in the evaluation of the aux¬ 
iliary helds ^ 1 ,..., Cn/ 2 - This information is needed since 6Q^ = [SQ]Q + Q [6Q]. 
A similar problem with the vectors Q^i for I = n/2,...,n — 1 can be avoided by a 


6Q^ a-l ® d-k + ^n-k ® d-, 


(42) 
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prudent usage of memory space associated with the vectors As a consequence, 
with respect to the method described in Section 2.2, when using the product 
representation of eq. f^ID , the memory reqnirements are basically halved at the 
price of an increase of the compntational cost of SSp, which can be estimated to 
be about 15-20%. 

5.2 Flexible trading between memory requirement and 
CPU time 

It is clear that in implementing the evaluation of 6Sp, eq. (0), one can trade off 
CPU time with memory space in different ways. We sketch here the basic idea 
of a flexible method for compromising between memory and CPU time, which 
we have fonnd very convenient in practical simulations. For simplicity we take 
the example of a polynomial, Pn^eiQ"^), eq.(0), with n = 100 and consider a non- 
optimized version of the method that we use in practice. A fnlly general and very 
detailed technical discussion of this method and its performance is deferred to 
Appendix B. 

A significant fraction of the memory is usually taken to store the gange links, 
their conjngate momenta, some psendofermion vectors, as needed for the fermion 
matrix inversion, and the dynamical pseudofermion held, 0, extracted from a 
probability distribution oc exp Let us imagine to divide the re¬ 

maining storage space (assnmed to be mnch less than what is needed for storing 
n = 100 pseudofermion vectors) into three sectors. In this particular case, with 
n = 100, the hrst and the second sector will contain 9 and the third sector only 
2 psendofermion vectors. It is clear that the third sector can be used as working 
space for fermion matrix times vector mnltiplications, where neither the initial 
vector nor the hnal one need to be stored elsewhere. 

We have already observed in Section 2.2 that the variation 5Sp, eq. of the 
pseudofermion action S'p, is a sum of n terms. Each term depends only on two 
auxiliary helds, Xj-i) X 2 n-j (and their complex conjugates), where j is the index 
over which the sum runs and the anxiliary vectors Xk are dehned in eq.(0). For 
the evalnation of 6Sp one can then proceed as follows. 

In a preliminary step, starting from 0, we constrnct the anxiliary vectors Xi, X 2 ^ 
• • •, X 89 , X 90 , and store only 9 vectors, namely xio, X 20 , ■ • •, Xso, X 90 , in the part 
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of the memory that we have indicated above as the hrst sector. 

Then, in a first step, starting from the saved vector X 90 , we construct the auxiliary 
vectors X 9 i, X 92 , ■ ■ ■, X 98 , X 99 and store all of them in the second sector. We are 
now in position to evaluate the first ten contributions to 6Sp, namely the ones 
corresponding to j = 100, 99,..., 91 in eg. ([T0|) . The point is that fermion matrix 
times vector multiplications can be performed in such a way that the third sector 
is employed to store in turn hrst Xioo and Xioi; then xioi and X 102 , and so on, up 
to X 108 and Xi 09 - 

In the second step, starting from the saved vector ygO) we construct the auxiliary 
vectors xsi, X 82 , • • •, X88, X 89 and store all of them in the second sector. We are 
now in position to evaluate further ten contributions to 6Sp, namely the ones 
corresponding to j = 90, 89,..., 81 in eg. (pT[) , making use of the third sector to 
temporarily store the various pairs of auxiliary vectors between xiw and Xii 9 i as 
explained above. 

Proceeding in an analogous way, we can evaluate in ten steps all the contri¬ 
butions to 5Sp. Notice that in each of these steps, except the hrst one, nine 
pseudofermion vectors, which had been computed and immediately overwritten 
during the preliminary step mentioned above, are computed again. This leads to 
a global computational cost, which is eguivalent to about 390 Q(j) operations, to 
be compared with the cost of about 300 Q0 operations needed for the method 
discussed in section 2.2 (for a single evaluation of 6Sp). This increase of the com¬ 
putational cost is just the price to be paid for evaluating 6Sp in the case n = 100 
by using only 20 (instead of 100) auxiliary pseudofermion vectors. A similar re¬ 
sult, with somewhat better compromise between memory and CPU time, is found 
for any value of the degree n of the PHMC polynomial when using the generalized 
version of this method which is described in Appendix B. 

Let us conclude with a general remark about the well-known problem of large 
memory reguirements, which is in principle common to all algorithms for dynam¬ 
ical fermions relying on a polynomial approximation of some negative power of 
the Dirac operator. The method presented in this section clearly shows that this 
problem, even for a polynomial of very high degree n, is in practice much less 
severe for the PHMC algorithm than for the multiboson algorithm. This is a 
conseguence of the fact that the number of dynamical pseudofermion fields is n 
in the multiboson algorithm and only one in the PHMC. This allows in the latter 
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case for a balance between the conflicting reqnirements of minimizing the nnmber 
of anxiliary psendofermion fields and maximizing the compntational efficiency. 

5.3 Introducing more psendofermion fields 

The last method of redncing memory reqnirements that we have stndied amonnts 
to introdncing more psendofermion fields and distribnting the monomial factors 
of the polynomial Pn,e{Q‘^) among them. Let us consider the action 

m 

sy’ = sy'i.#.,!/] = ■ (^) 

In eq. (|4^) we have introduced m positive definite subpolynomials P^ne-miQ) 
of degree 2n/m such that their product yields In this way, one has to 

have memory space for only m + n/m psendofermion fields in practise and hence 
would significantly reduce the memory requirements. 

However, it is clear already at this stage that by changing m the dynamics of the 
algorithm will change: for m = 1 we recover our PHMC algorithm. For m = nwe 
are in the case of the original multiboson algorithm and would have an increase of 
the autocorrelation time with n. It might be hoped, however, that by choosing m 
small enough, the dynamics is not changed too much and that in this way again 
a reduction of memory requirements can be achieved. 

It should be emphasized that when using the action eq.(P^ special care has to be 
taken for the ordering of the roots in order not to generate unwanted effects that 
come purely from rounding errors. Without going into detail here, we note that 
by using e.g. the bit-reversal scheme, a suitable ordering of the roots avoiding 
rounding-error effects can be obtained. In addition, we checked that by running 
the program with 64-bit precision our results, quoted below, did not change. 

We have done several tests for different choices of m in the range m G [2,10]. 
We report our results obtained in the SU{2) gauge theory with two flavours of 
dynamical Wilson fermions. We set Csw = 0 and take a lattice of size 8^ x 16 
with periodic boundary conditions. We have considered two choices of the bare 
parameters, /3 = 2.12, k = 0.15 and /3 = 1.75, k = 0.165, using the subpolynomial 
and the bit-reversal ordering schemes of monomial factors. 

The effects for different choices of m should most clearly appear in the Hamilto- 
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(46) 


1 


H =5 +Sjic'i+ st'% u] , 


X,jj. C=1 


used in the molecular dynamics part of the PHMC algorithm. In eg .(1451) vr^ 
denote the momenta conjugate to the gauge helds. We monitored the differences 
between the initial and hnal Hamiltonians in a molecular dynamics trajectory. For 
all parameters considered in table ^ we always started from the same thermalized 
gauge held conhguration and kept the step size dt = 0.04 and the number of steps 
Nstep = 10 hxed. 

In table § we give our results for the differences of the initial and hnal Hamilto¬ 
nians hfend — hTin, of the gauge links \\Uend — fhnil and of their conjugate momenta 
(llTTend “ TTinU), as measured at the beginning and at the end of a trajectory. The 
dehnition of ||t/end — flinP is analogous to the dehnition of ||17in — HrevP, eg. (PUD , 
with a normalisation factor of because the gauge group is now SU{2). 

Finally, we dehne 

1 


"^end 


TTin — 


12V 


i(yw).nd - (yw)i 


(46) 


X,fl,C 


Table 6: The diherences of the initial and hnal values of the Hamiltonian, the 
momenta and the gauge links. Results are obtained on a 8^ x 16 lattice at 
(3 = 1.75, K = 0.165, in the SU{2) gauge theory. 


n 

e 

(m, order) 

-^end Hiyv 

11 ^end Tn 11 

llflend - f/iniP 

64 

0.0015 

(1, HR) 

0.63 

0.301 

0.0657 

64 

0.0015 

(1, SP) 

0.63 

0.301 

0.0657 

64 

0.0015 

(8, HR) 

28.1 

1.170 

0.0357 

64 

0.0015 

(8, SP) 

40.9 

1.158 

0.0354 

64 

0.0005 

(1, HR) 

1.33 

0.310 

0.0655 

64 

0.0005 

(8, HR) 

101 

0.856 

0.0222 


As the results shown in table ^ indicate, the behaviour of the molecular dynamics 
part of the algorithm looks such that in the case with m = 8 one typically gets 
larger time discretisation ehects. This is clearly seen by the values of i7end — 

At the same time, the diherence in the momenta HvTend — TTinll becomes larger. 
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too, while the difference in the gauge links ||f/end — h^in|| becomes sma//er than 
in the case m = 1. This might be explained by the fact that the gauge links 
are always normalized to SU(2) matrices and that they counteract the behaviour 
of the momenta to render the difference hfend — hfin small. The results depend 
also on the distribution of the monomial factors among the subpolynomials 
eq.(^), as the comparison between the bit-reversal and the subpolynomial cases 
shows. When reducing the value of e we again find even more different results, as 
shown by the last two lines of table Tests performed with gauge group SU(3) 
and Schrodinger functional boundary conditions revealed a similar qualitative 
behaviour. 

We conclude that, in order to get a reasonable acceptance rate in the cases with 
m > 1, one is forced to reduce the value of dt substantially, resulting in a higher 
cost of a simulation. It seems to us that the case m = 1, i.e. the PHMC algorithm 
is most efficient. Of course, we cannot exclude that there are other possibilities 
of choosing subpolynomials that give a reduction of memory requirements and 
do not worsen the dynamical behaviour of the algorithm. On the other hand, the 
solution to the problem of memory requirement discussed in section 5.2 appears 
to be already satisfactory. 

6 Conclusions 

We gave in this paper a detailed description of the PHMC algorithm, which relies 
on a combination of the HMC algorithm and the multiboson technique to simulate 
dynamical fermions |^, Q . We discussed the computational cost of the algorithm, 
checked that rounding-error effects that can appear are under control and showed 
possible ways to reduce memory requirements. 

The effects of the correction factor that is introduced to render the algorithm 
exact, was studied in detail. Special emphasis was put on the fact that the 
PHMC algorithm samples the configuration space very differently compared to 
the most commonly used HMC algorithm. In particular, some evidence was given 
that the region of gauge configuration space characterized by the presence of low 
lying modes of is explored much better when using the PHMC algorithm. 

Of course, it is important to compare the performance of the PHMC algorithm 
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with the one of the HMC algorithm. The work presented here lays the ground 
for such an investigation of the performance of the PHMC algorithm on which 
we will report in a separate publication [^. There we will also show further 
evidence that the PHMC algorithm samples conhguration space differently from 
the HMC algorithm and discuss consequences for physical observables. 
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A Derivation of eq.(|S3|) 


We start again from the nj 


2 lattice QCD partition function, eg. (^ID : 


Z = J VUV(j)^V(^Vr]^V7]WB[n,U]WiR[U]e-^^o+^^+^^^ 

Sp = 5p[[/,0] = (/.tp„,,(g2[f/])0 
Srj = V^V ■ 


(47) 


The splitting of the original correction factor W into two parts, an “infrared” 
part, 

Wir[U]= n [1 + ^n,. (A,)] =det(L„,,, ,[[/]), (48) 

Aj<e 

and a “bulk” part, 

Wsiv, U] = exp ^r]^[l - Ln,e,e ■ (g^ • Pn,e)~^]v} > (49) 

follows in a natural, unbiased way from the introduction of an operator which 
acts on pseudofermion helds and depends on e, n, e and the gauge conhguration 
U: 

L = = 1 + ^ |Ag(A,|P„,,(A,)0(e - A,) . (50) 

j 
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Since the index j runs over all the eigenvalues of the operator L, which can be 
diagonalised simultaneously with has eigenvalues given by 1 + Rn,e{^j) for the 
modes of with Xj < e and by 1 otherwise. Due to the properties of the relative 
fit error function Rn^e{X) = \Pn,e{X) — 1, the operator L is Hermitean and strictly 
positive, if all the X/s are strictly positive: a zero mode of L appears only in one- 
to-one correspondence with a zero mode of Q^. In particular, the operator L has 
exactly the same infrared behaviour as the operator Q^Pn,e{.Q^) = ^ + Rn,e{Q^), if 
modes with Xj < e are present. However, because of the 6 functions appearing in 
its definition, L is not a smooth functional of the (lattice) gauge field, in contrast 
with the operator Q^Pn,e{Q^)- 

It is then straightforward to show that eq.(|^ can be rewritten in the form of 
eq.(^) by introducing the pseudofermion field vector \ri±) as in eq. (pH) . 

B Optimizing memory requirements 

We present here a general and flexible method for some optimal trading of CPU 
time with memory requirement in the implementation of the PHMC algorithm, 
which turns out to be very convenient in practical simulations. 

Suppose that we wish to use the PHMC algorithm with the polynomial Pn,e{Q‘^)^ 
eq. ( 0 ), where only the degree n is relevant for the present discussion, following the 
implementation described in Section 2.2. Suppose also that the lattice size and 
the memory capacity of our computer are such that, in addition to gauge helds, 
their conjugate momenta and other working arrays, only + 1 pseudofermion 
helds can be stored. One of these must necessarily be the “dynamical” held 0 
extracted from the probability distribution exp [—so we are left 
with the possibility of using at most N auxiliary pseudofermion helds during 
the evaluation of 5Sp, eq. m- Since for iV > n it is possible to use the method 
described in Section 2.2, we consider here only the case N < n, which corresponds 
to the situation of a relatively small storage capacity. 

We have already observed in Section 2.2 that the variation 6Sp, eq. o. of the 
pseudofermion action Sp, is a sum of n terms. Each term depends only on two 
auxiliary helds, Xj-i) X 2 n-j (and their complex conjugates), where j is the index 
over which the sum runs. We remark that in evaluating 6Sp it is convenient to 
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compute first the term in the sum with j = n, then the one with j = n — 1, 
etc., down to the last with j = 1. Indeed, in this way, the auxiliary fields Xh 
with I > n are required in the following, natural order: hrst Xn, then Xn+i, etc., 
up to X 2 n-i- Given this situation, the basic idea of our method is to divide the 
available storage space for N auxiliary pseudofermion helds into three parts: 

• (a) A fixed storage part, where only M of the auxiliary fields Xi, ..., Xn 
should be stored; let us denote them hy Xh, ■ ■ ■, Xim- 

• (b) A first working space part, where K pseudofermion fields can be stored; 
this storage space should be large enough to construct Xn-i starting from 
XiM, as well as Xim-i starting from Xi„^-l, for all m = 1, 2 ,..., M - 1. 

• (c) A second working space part, where 2 pseudofermion helds can be 
stored; this allows, for any given I > n, to construct and store the held 
Xi+i) starting from the held Xh while keeping it stored, too. 

The relation M + K + 2 < N must of course be satished^. This relation and 
the above requirement about the size of the part (b) of the storage space impose 
restrictions on the possible choices of the integers M and K, as well as of the 
set of integers Im = {A, A, • • •, *m}, satisfying A < A < • • • < For the 

moment, let us assume that a choice of M, K and Xm exists which satishes our 
requirements; we discuss below some examples and their practical performance. 
It is clear that under these assumptions the evaluation oi hSp can be performed 
following the strategy sketched in the steps below. 

• In a preliminary step, starting from xo = A, construct all the auxiliary helds 
Xj, with j < iM and store only Xh, ■ ■ ■, Xim ^Fe sector (a). 

• Set iM+i = n, io = 0. Then go through the following recursive pro¬ 
cedure, where s is an integer labelling the steps of the recursion: s = 
1 ,2 ,...,M,M+1 . 

• For a given value of s, let us dehne the auxiliary integers p = iu+i-s and 
q = iM+ 2 -s- Then the step s can be described as follows. Starting from Xip 

^We leave for the moment the freedom of using only part of the available memory, i.e. taking 
M + K + 2<N. 
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construct the fields Xip+ii ■ ■ ■ i Xiq-i and store them in the sector (b). Then 
evaluate the contributions to 5Sp, eq. (pT[) , with j = ig, ig — 1 ,..., ip + 1 
(just in this order), using the sector (c) to construct and temporarily store 
in turn the relevant auxiliary fields X 2 n-iq, X 2 n-i,+i, • • •, X 2 n-ip-i- 

It may happen that only part of the sector (b) has to be used in the step with 
s = M + 1. It is also important to remark that in the steps with s > 1 a new 
evaluation of the auxiliary fields Xj is required, for all values of j < im and not 
belonging to Tm- Indeed, these auxiliary fields were already constructed and then 
overwritten during the preliminary step mentioned above. The CPU time needed 
for their recomputation in the M steps with s > 1 represents the price to be paid 
for computing SSp using a number of auxiliary fields less than the degree of the 
PHMC polynomial. On the other hand, no such recomputation occurs in the step 
with s = 1. 

Let us come now to the determination of M, K, and Im as functions of n and 
N. We recall that the chosen values of M and K must satisfy: 

M + K + 2<N, (M+l)(iC +1) > n , (51) 

where the second condition is equivalent to the above requirement on the size 
of the part (b) of the storage space. From the description of our strategy for 
computing the variation of Sp, it should be clear that this condition guarantees 
that all of the n terms appearing in eq.(|T0|) for 6Sp can actually be evaluated. 
For any choice of M and K compatible with eq.(^lj), the set of integers Im can 
be defined as follows: 

im = n — {M — m + 1){K + 1) , m = 1, 2,..., M . (52) 

On the other hand, with respect to the simple method of Section 2.2, the com¬ 
putational overhead, due to the need of evaluating twice some of the auxiliary 
fields, is given in units of Q<p operations by: 

Cextra = Im — M = U — 1 — {K + M) > H — N + 1 . (53) 

The optimal choices of M and K are the ones which minimize Oextra, he. maximize 
M + K, compatibly with eq. (0- this amounts to saturating the bound M + K + 
2 < N and yields Oextra = n — N + 1. In table (|^ we illustrate the performance 
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of this method for evaluating 6Sp in some typical cases. Moreover, that table 
also contains results obtained by using a modified version of the method, which 
is useful in cases when a very limited storage space is available. 

This modified version relies on the fact that it is not strictly necessary to keep 
constant, during the various steps of the computation of 6Sp, the size of the parts 
(a) and (b) of the storage space. Indeed, we may only require that the parts (a) 
and (b) have a constant global size, measured by the sum M + K. We will see 
that the freedom of varying the size of the single parts (a) and (b) allows for 
reducing the minimal storage required for the computation of SSp. 

For instance, after the step with s = 1 the auxiliary field Xim is no longer needed; 
which means that in the step with s = 2 we may take the parts (a) and (b) to 
have size M — 1 and K + 1, respectively. For the same reason, after each step 
we may decrease by one unit the size of the part (a) and increase by one unit 
the size of the part (b), ending with a part (b) of size K + M in the step with 
s = M + 1. It is then clear that, with a suitable dehnition of the integers in 
Im, the computation of 6Sp can be performed following exactly all the steps of 
our method, as explained above. Of course, the meaning of the integers M and 
K is now different: they only give the size of the parts (a) and (b) during the 
preliminary step and the step with s = 1. 

The conditions to be fulfilled by the admissible choices of M and K, as functions 
of n and iV, are also modified, as well as the corresponding definition of Im- 
While the condition M + K + 2 < N remains obviously valid, the definition 
(|5^ and the second condition in eq. (|^ should be replaced, respectively, by the 
recursive dehnition: 


*M+i — n 

iM+i-i = hi+i-.-s-K, s = 1,2,..., Af + 1 (54) 

(55) 


and by the condition 

zi < iF + M + 1 . (56) 

On the other hand, the computational overhead with respect to the simple method 


of Section 2.2 can be shown to be still given by eq. (|5^ . As a consequence, the 
optimal choices of M and K are the ones that maximize the sum M + K and 
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are compatible with the new conditions necessary for the fnll evalnation of 6Sp. 
In the following we will refer to the two versions of the method presented in this 
section, the one with hxed size of the parts (a) and (b) of the memory and the 
one with variable size of the same parts, as to the basic and the modified version, 
respectively. The two versions are compared in table (^, for a few valnes of n 
in the range relevant for cnrrent day simulations using the PHMC algorithm. In 
order to give an idea of the criticality of the simulations, we also quote the typical 
condition number of denoted by k{Q‘^), for the values of n considered. 

For each value of n , we consider different values of N\ the minimal one {N = 
needed for using the modified version, the minimal one {N = Nmin) needed 
for using the basic version, iV = n/2 (in order to compare with the method of 
Section 5.1) and N = n — In table (|^ a prime is used to denote the 

quantities relative to the modified version of the computational method under 
study. 

From table (|^ we can see that the method presented in this section indeed enables 
one to save a signihcant amount of storage space at the price of a very moderate 
computational overhead. Namely, memory requirements are reduced by a large 
factor, which increases with n and is about 5-i-9 for the considered values of n. On 
the other hand, the relative computational overhead, as measured by Oextra/Sn, 
eq.(^), increases very slowly with n, approaching asymptotically the value 1/3. 
It is also important to remark that these numbers refer to the maximal memory 
saving that can be achieved by the method. However we can see from table (|^) 
that for each value of n many other choices of N are allowed, which correspond 
to a different balance between memory saving and computational efficiency. Such 
a ffexibility makes it very easy to optimize the balance between memory saving 
and computational efficiency in any simulation setup, which represents a clear 
advantage of the method presented in this section in comparison with the ones 
discussed in Sections 2.2 and 5.1. 

Let us come now to the comparison between the basic and the modihed version 
of the method presented in this section. The latter version turns out to be 
more effective than the former in saving memory, as expected: we always hnd 
^min — -^min In table (|^), where the dots stand for cases when the basic version 
does not allow for a full evaluation of 6Sp. However, for values of N allowed in 
both versions, there is no difference in the computational overhead between the 
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Table 7: Performance for the two versions of the method of Section 5.2 for com¬ 
puting 6Sp, eg. (pup . The notation is dehned in the text: (Mq, Kq) denote, in the 
basic version of the method, the choice of (M, K) that minimizes, for given n and 
N, both Cextra and M itself. The corresponding primed quantities are relative to 
the modihed version. Note that the minimal value of Cextra corresponding to the 
given values of n and N is always realized. 


n 

k{Q^) 

N 

{Mo,Ko) 

extra 

C extra 

(M; Ao) 

C' 

^extra 


50 

770 

11 




(7.2) 

40 

0.267 

50 

770 

15 

(4.9) 

36 

0.240 

(7.2) 

40 

0.240 

50 

770 

25 

(2,21) 

26 

0.173 

(2,21) 

26 

0.173 

50 

770 

39 

(1,36) 

12 

0.080 

(1,36) 

12 

0.080 

100 

1470 

15 




(11.2) 

86 

0.287 

100 

1470 

20 

(9,9) 

81 

0.270 

(6,12) 

81 

0.270 

100 

1470 

50 

(2,46) 

51 

0.170 

(2,46) 

51 

0.170 

100 

1470 

85 

(1,82) 

16 

0.053 

(1,82) 

16 

0.053 

180 

4760 

20 




(14,4) 

161 

0.298 

180 

4760 

27 

(11,14) 

154 

0.285 

(7,18) 

154 

0.285 

180 

4760 

90 

(2,86) 

91 

0.169 

(2,86) 

91 

0.169 

180 

4760 

160 

(1,157) 

21 

0.039 

(1,157) 

21 

0.039 


two versions. Finally, we remark that for given values of n and iV, there are several 
choices of M and K that yield the minimal overhead (Cextra = n — N + 1) and are 
compatible with all the necessary conditions specihed above. At this level, some 
further differences between the two versions appear, which are however irrelevant 
in practice. In both versions, in particular, it is not possible to choose a too small 
value of M, for given values of n and N, if all the necessary conditions are to be 
fulhlled and the minimal overhead is to be achieved: the table shows also the 
lowest allowed values of M, and the corresponding ones of K, for the different 
cases. 
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